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Abstract 

This paper investigates a new learning formulation called structured sparsity, which is a 
natural extension of the standard sparsity concept in statistical learning and compressive sensing. 
By allowing arbitrary structures on the feature set, this concept generalizes the group sparsity 
idea that has become popular in recent years. A general theory is developed for learning with 
structured sparsity, based on the notion of coding complexity associated with the structure. It is 
shown that if the coding complexity of the target signal is small, then one can achieve improved 
performance by using coding complexity regularization methods, which generalize the standard 
sparse regularization. Moreover, a structured greedy algorithm is proposed to efficiently solve 
the structured sparsity problem. It is shown that the greedy algorithm approximately solves 
the coding complexity optimization problem under appropriate conditions. Experiments are 
included to demonstrate the advantage of structured sparsity over standard sparsity on some 
real applications. 



1 Introduction 

We are interested in the sparse learning problem under the fixed design condition. Consider a 
fixed set of p basis vectors {xi,...,x p } where Xj G M. n for each j. Here, n is the sample size. 
Denote by X the n x p data matrix, with column j of X being x^. Given a random observation 
y = [yi, . . . , y n ] £ M n that depends on an underlying coefficient vector (3 6 W, we are interested in 
the problem of estimating (3 under the assumption that the target coefficient (3 is sparse. Throughout 
the paper, we consider fixed design only. That is, we assume X is fixed, and randomization is with 
respect to the noise in the observation y. 

We consider the situation that Ey can be approximated by a sparse linear combination of the 
basis vectors: 

Ey « Xp, 

where we assume that (3 is sparse. Define the support of a vector (3 6 W as 

supp(/3) = {j : 0j + 0}, 
and ||/3||o = |supp(/3)|. A natural method for sparse learning is Lq regularization: 



Plo = arg min Q((3) subject to \\(3\\o < s, 



where s is the desired sparsity. For simplicity, unless otherwise stated, the objective function 
considered throughout this paper is the least squares loss 

Q(f3) = \\X (3 -y\& 

although other objective functions for generalized linear models (such as logistic regression) can be 
similarly analyzed. 

Since this optimization problem is generally NP-hard, in practice, one often considers approxi- 
mate solutions. A standard approach is convex relaxation of Lq regularization to L\ regularization, 
often referred to as Lasso [22]. Another commonly used approach is greedy algorithms, such as the 
orthogonal matching pursuit (OMP) [22]. 

In practical applications, one often knows a structure on the coefficient vector (5 in addition 
to sparsity. For example, in group sparsity, one assumes that variables in the same group tend 
to be zero or nonzero simultaneously. The purpose of this paper is to study the more general 
estimation problem under structured sparsity. If meaningful structures exist, we show that one can 
take advantage of such structures to improve the standard sparse learning. 

2 Related Work 

The idea of using structure in addition to sparsity has been explored before. An example is group 
structure, which has received much attention recently. For example, group sparsity has been con- 
sidered for simultaneous sparse approximation [21] and multi-task compressive sensing [33] from the 
Bayesian hierarchical modeling point of view. Under the Bayesian hierarchical model framework, 
data from all sources contribute to the estimation of hyper-parameters in the sparse prior model. 
The shared prior can then be inferred from multiple sources. He et al. recently extend the idea 
to the tree sparsity in the Bayesian framework [TTJ Q2]. Although the idea can be justified using 
standard Bayesian intuition, there are no theoretical results showing how much better (and under 
what kind of conditions) the resulting algorithms perform. In the statistical literature, Lasso has 
been extended to the group Lasso when there exist group /block structured dependences among the 
sparse coefficients [22]. 

However, none of the above mentioned work was able to show advantage of using group structure. 
Although some theoretical results were developed in [TJ IE], neither showed that group Lasso is 
superior to the standard Lasso. The authors of [33] showed that group Lasso can be superior 
to standard Lasso when each group is an infinite dimensional kernel, by relying on the fact that 
meaningful analysis can be obtained for kernel methods in infinite dimension. In [19J, the authors 
consider a special case of group Lasso in the multi-task learning scenario, and show that the number 
of samples required for recovering the exact support set is smaller for group Lasso under appropriate 
conditions. In [T3], a theory for group Lasso was developed using a concept called strong group 
sparsity, which is a special case of the general structured sparsity idea considered here. It was 
shown in [T3] that group Lasso is superior to standard Lasso for strongly group-sparse signals, 
which provides a convincing theoretical justification for using group structured sparsity. 

While group Lasso works under the strong group sparsity assumption, it doesn't handle the more 
general structures considered in this paper. Several limitations of group Lasso were mentioned in 
[13] . For example, group Lasso does not correctly handle overlapping groups (in that overlapping 
components are over-counted); that is, a given coefficient should not belong to different groups. 
This requirement is too rigid for many practical applications. To address this issue, a method 
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called composite absolute penalty (CAP) is proposed in [27] which can handle overlapping groups. 
Unfortunately, no theory is established to demonstrate the effectiveness of the approach. In a related 
development [16J, Kowalski et al. generalized the mixed norm penalty to structured shrinkage, which 
can identify the structured significance maps and thus can handle the case of the overlapping groups, 
However, the structured shrinkage operations do not necessarily convergence to a fixed point. There 
were no additional theory to justify their methods. 

Other structures have also been explored in the literature. For example, so-called tonal and 
transient structures were considered for sparse decomposition of audio signals in [5], but again 
without any theory. Grimm et al. [TO] investigated positive polynomials with structured sparsity 
from an optimization perspective. The theoretical result there did not address the effectiveness of 
such methods in comparison to standard sparsity. The closest work to ours is a recent paper [2], 
which we learned after finishing this paper. In that paper, a specific case of structured sparsity, 
referred to as model based sparsity, was considered. It is important to note that some theoretical 
results were obtained there to show the effectiveness of their method in compressive sensing, although 
in a more limited scope than results presented here. Moreover, they do not provide a generic 
framework for structured sparsity. In their algorithm, different schemes have to be specifically 
designed for different data models, and under specialized assumptions. It remains as an open issue 
how to develop a general theory for structured sparsity, together with a general algorithm that can 
be applied to a wide class of such problems. 

We see from the above discussion that there exists extensive literature on structured sparsity, 
with empirical evidence showing that one can achieve better performance by imposing additional 
structures. However, none of the previous work was able to establish a general theoretical framework 
for structured sparsity that can quantify its effectiveness. The goal of this paper is to develop such 
a general theory that addresses the following issues, where we pay special attention to the benefit 
of structured sparsity over the standard non-structured sparsity: 

• quantifying structured sparsity; 

• the minimal number of measurements required in compressive sensing; 

• estimation accuracy under stochastic noise; 

• an efficient algorithm that can solve a wide class of structured sparsity problems. 

3 Structured Sparsity 

In structured sparsity, not all sparse patterns are equally likely. For example, in group sparsity, 
coefficients within the same group are more likely to be zeros or nonzeros simultaneously. This means 
that if a sparse coefficient vector's support set is consistent with the underlying group structure, 
then it is more likely to occur, and hence incurs a smaller penalty in learning. One contribution 
of this work is to formulate how to define structure on top of sparsity, and how to penalize each 
sparsity pattern. 

In order to formalize the idea, we denote by J = {1, . . . ,p} the index set of the coefficients. 
Consider any sparse subset F C {1, . . . ,p}, we assign a cost cl(F). In structured sparsity, the cost 
of F is an upper bound of the coding length of F (number of bits needed to represent F by a 
computer program) in a pre-chosen prefix coding scheme. It is a well-known fact in information 
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theory (e.g. [7J) that mathematically, the existence of such a coding scheme is equivalent to 

£ 2 -cl(F) < L 
FCT 

From the Bayesian statistics point of view, 2~ cl ( F ) can be regarded as a lower bound of the proba- 
bility of F. The probability model of structured sparse learning is thus: first generate the sparsity 
pattern F according to probability 2~ cl ( F ); then generate the coefficients in F. 

Definition 3.1 A cost function cl(F) defined on subsets of I is called a coding length (in base-2) if 

2- d(F) < 1. 

We give a coding length 0. The corresponding structured sparse coding complexity of F is defined 
as 

c(F) = \F\ +cl(F). 

A coding length cl(F) is sub-additive if 

cl(FUF') < cl(F) +cl(F'), 

and a coding complexity c(F) is sub-additive if 

c(FuF') < c{F) + c{F'). 

Clearly if cl(F) is sub-additive, then the corresponding coding complexity c(F) is also sub- 
additive. Based on the structured coding complexity of subsets of X, we can now define the struc- 
tured coding complexity of a sparse coefficient vector (3 G MP. 

Definition 3.2 Giving a coding complexity c(F), the structured sparse coding complexity of a coef- 
ficient vector (3 £W is 

c{(3) = min{c(F) : supp(/3) C F}. 

Later in the paper, we will show that if a coefficient vector (3 has a small coding complexity 
c((3), then (3 can be effectively learned, with good in-sample prediction performance (in statistical 
learning) and reconstruction performance (in compressive sensing). In order to see why the definition 
requires adding |F| to cl(F), we consider the generative model for structured sparsity mentioned 
earlier. In this model, the number of bits to encode a sparse coefficient vector is the sum of the 
number of bits to encode F (which is cl(-F)) and the number of bits to encode nonzero coefficients 
in F (this requires 0(|F|) bits up to a fixed precision). Therefore the total number of bits required 
is cl(F) + 0(\F\). This information theoretical result translates into a statistical estimation result: 
without additional regularization, the learning complexity for least squares regression within any 
fixed support set F is 0(|F|). By adding the model selection complexity cl(F) for each support set 
F, we obtain an overall statistical estimation complexity of 0(cl(F) + \F\). 

While the idea of using coding based penalization is clearly motivated by the minimum de- 
scription length (MDL) principle, the actual penalty we obtain for structured sparsity problems 
is different from the standard MDL penalty for model selection. This difference is important in 
sparse learning. Therefore in order to prevent confusion, we avoid using MDL in our terminology. 
Nevertheless, one may consider our framework as a natural combination of the MDL idea and the 
modern sparsity analysis. 



4 



4 Structured Sparsity Examples 



Before giving detailed examples, we introduce a general coding scheme called block coding. The 
basic idea of block coding is to define a coding scheme on a small number of base blocks (a block is 
a subset of J), and then define a coding scheme on all subsets of X using these base blocks. 

Consider a subset B C 2 1 . That is, each element (a block) of B is a subset of X. We call B 
a block set if X = Ub&bB and all single element sets {j} belong to B (j € X). Note that B may 
contain additional non single-element blocks. The requirement of B containing all single element 
sets is for convenience, as it implies that every subset F C X can be expressed as the union of blocks 
in B. 

Let clo be a code length on B: 

£ 2 -cl„(B) < ^ 

BeB 

we define cl(B) = clo(-B) + 1 for B 6 B. It not difficult to show that the following cost function on 
F C X is a coding length 

{b b 
J2cl(B j ):F=(jB j (Bj G B) 
3=1 3 = 1 

This is because 

j2 2~ cl{F) <Y, 12 2 ~ E * =1 cl(Be) ^ n 12 2 ~ cm) <12 2 ~ b = 1 - 

Fcl,F^9 b>l {B e }&B b b>l£=lB e eB b>l 

It is clear from the definition that block coding is sub-additive. 

The main purpose of introducing block coding is to design computationally efficient algorithms 
based on the block structure. In particular, this paper considers a structured greedy algorithm that 
can take advantage of block structures. In the structured greedy algorithm, instead of searching over 
all subsets of X up to a fixed coding complexity s (the number of such subsets can be exponential 
in s), we greedily add blocks from B one at a time. Each search problem over B can be efficiently 
performed because we require that B contains only a computationally manageable number of base 
blocks. Therefore the algorithm is computationally efficient. 

We will show that under appropriate conditions, a target coefficient vector with a small block 
coding complexity can be approximately learned using the structured greedy algorithm. This means 
that the block coding scheme has important algorithmic implications. That is, if a coding scheme 
can be approximated by block coding with a small number of base blocks, then the corresponding 
estimation problem can be approximately solved using the structured greedy algorithm. For this 
reason, we shall pay special attention to block coding approximation schemes for examples discussed 
below. 



Standard sparsity 

A simple coding scheme is to code each subset F C X of cardinality k using klog 2 (2p) bits, which 
corresponds to block coding with B consisted only of single element sets, and each base block has 
a coding length cIq = log 2 p. This corresponds to the complexity for the standard sparse learning. 
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A more general version is to consider single element blocks B = {{j} : j G J}, with a non- 
uniform coding scheme clo({j}) = Cj, such that Ylj^ Cj — 1- ^ leads to a non-uniform coding 
length on 1 as 

cl(B) = \B\ + ^2cj. 

In particular, if a feature j is likely to be nonzero, we should give it a smaller coding length Cj, and 
if a feature j is likely to be zero, we should give it a larger coding length. 

Group sparsity 

The concept of group sparsity has appeared in various recent work, such as the group Lasso in [25J. 
Consider a partition of 1 = {J^jL-^Gj into m disjoint groups. Let Bq contain the m groups {Gj}, 
and B\ contain p single element blocks. The strong group sparsity coding scheme is to give each 
element in B\ a code-length clo of oo, and each element in Be a code-length clo of log 2 m. Then the 
block coding scheme with blocks B = Bq U B\ leads to group sparsity, which only looks for signals 
consisted of the groups. The resulting coding length is: cl(-B) = <?log 2 (2r?i,) if B can be represented 
as the union of g disjoint groups Gj] and cl(B) = oo otherwise. 

Note that if the signal can be expressed as the union of g groups, and each group size is ko, then 
the group coding length glog 2 (2m) can be significantly smaller than the standard sparsity coding 
length of gko log 2 (p) . As we shall see later, the smaller coding complexity implies better learning 
behavior, which is essentially the advantage of using group sparse structure. It was shown in (T3] 
that strong group sparsity defined above also characterizes the performance group Lasso. Therefore 
if a signal has a pre-determined group structure, then group Lasso is superior to standard Lasso. 

An extension of this idea is to allow more general block coding length for clo(Gj) and clo({j}) 
so that 

m p 

^ 2- c1 o(°j) + 2~ cl()({i}) < 1. 

3=1 3=1 

This leads to non-uniform coding of the groups, so that a group that is more likely to be nonzero is 
given a smaller coding length. 

Hierarchical sparsity 

One may also create a hierarchical group structure. A simple example is wavelet coefficients of a 
signal [17]. Another simple example is a binary tree with the variables as leaves, which we describe 
below. Each internal node in the tree is associated with three options: left child only, right child 
only, and both children; each option can be encoded in log 2 3 bits. 

Given a subset F C X, we can go down from the root of the tree, and at each node, decide 
whether only left child contains elements of F, or only right child contains elements of F, or 
both children contain elements of F. Therefore the coding length of F is log 2 3 times the total 
number of internal nodes leading to elements of F. Since each leaf corresponds to no more than 
log 2 p internal nodes, the total coding length is no worse than log 2 31og 2 p|F|. However, the coding 
length can be significantly smaller if nodes are close to each other or are clustered. In the extreme 
case, when the nodes are consecutive, we have 0(\F\ + log 2 p) coding length. More generally, 
if we can order elements in F as F = {ji, ■ ■ ■ ,j q }, then the coding length can be bounded as 
cl(F) = Q(\F\ +log 2 p + J2 g s=2 log 2 mm e<s \j s -je\)- 
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If all internal nodes of the tree are also variables in X (for example, in the case of wavelet 
decomposition), then one may consider feature set F with the following property: if a node is 
selected, then its parent is also selected. This requirement is very effective in wavelet compression, 
and often referred to as the zero- tree structure [21]. Similar requirements have also been applied in 
statistics |27J for variable selection. The argument presented in this section shows that if we require 
F to satisfy the zero- tree structure, then its coding length is at most 0(\F\), without any explicit 
dependency on the dimensionality p. This is because one does not have to reach a leave node. 

The tree-based coding scheme discussed in this section can be approximated by block coding 
using no more than p 1+s base blocks (5 > 0). The idea is similar to that of the image coding 
example in the more general graph sparsity scheme which we discuss next. 

Graph sparsity 

We consider a generalization of the hierarchical and group sparsity idea that employs a (directed or 
undirected) graph structure G on X. To the best of our knowledge, this general structure has not 
been considered in any previous work. 

In graph sparsity, each variable (an element of X) is a node of G but G may also contain 
additional nodes that are not variables. For simplicity, we assume G contains a starting node (this 
requirement is not critical). 

At each node v G G, we define coding length cLj(S') on all subsets S of the neighborhood N v of 
v including the empty set, as well as any other single node u S G with cl v (u), such that 



To encode F C G, we start with the active set containing only the starting node, and finish when 
the set becomes empty. At each node v before termination, we may either pick a subset S C N v , 
with coding length cl v (S), or a node in u S G, with coding length cl v (u), and then put the selection 
into the active set. We then remove v from the active set (once a node v is removed, it does not 
return to the active set anymore). This process is continued until the active set becomes empty. 

As a concrete example, we consider image processing, where each image is a rectangle of pixels 
(nodes); each pixel is connected to four adjacent pixels, which forms the underlying graph structure. 
At each pixel, the number of subsets in its neighborhood is 2 4 = 16 (including the empty set), and 
each subset is given a coding length cl„(5) = 5; we also encode all other pixels in the image 
with random jumping, each with a coding length 1 + log 2 p. Using this scheme, we can encode 
each connected region F by no more than log 2 p + 5|X| bits by growing the region from a single 
point in the region. Therefore if F is composed of g connected regions, then the coding length is 
g\og 2 p + 5|X|, which can be significantly better than standard sparse coding length of \F\ log 2 p. 

This example shows that the general graph coding scheme presented here favors connected 
regions (that is, nodes that are grouped together with respect to the graph structure). In particular, 
it proves the following more general result. 

Proposition 4.1 Given a graph G, there exists a constant Cq such that for any probability distri- 
bution q on G (^2 v& Gl( v ) = 1 an d Q( v ) — ® f or v G the following quantity is a coding length 




ScN v 



u&G 
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where F C 2 G can be decomposed into the union of g connected components F — U-^Fj. 

Our simple and suboptimal coding scheme described for the image processing example gives an 
upper bound Cq < 1 + do, where do is the maximum degree of G. However, for many graphs, one 
can improve this constant to O (log da) with a slightly more complicated argument. 

As a simple application of graph coding, we consider the special case where we have only one 
connected component that contains the starting node vq. We can simply let q(vo) = 1, and the 
coding length is 0(|F|), which is independent of the dimensionality p. This generalizes the similar 
claim for the zero-tree structure described earlier. 

The graph coding scheme can be approximated with block coding. The idea is to consider 
relatively small sized base blocks consisted of nodes that are close together with respect to the 
graph structure, and then use the induced block coding scheme to approximate the graph coding. 

For example, for the previously discussed image coding example, we can use connected blocks 
of size upto (51og 2 p/5 as base blocks of B (6 > 0). Since each base block can be encoded with 
(1 + <5)log 2 p bits by earlier discussion, we know that the total number of base blocks can be no 
more thanp 1+<5 . We can give each of such blocks a coding length (1 + 5) log 2 p. For a connected 
region F that can be covered by 0(1 + |F|/log 2 p) of such blocks, the corresponding block coding 
length for a subset F is cl(F) = 0(\F\ +log 2 p), which is the same as the complexity of the original 
graph coding length (up to a constant). This means that graph coding length can be approximated 
with block coding scheme. As we have pointed out, such an approximation is useful because the 
latter is required in the structured greedy algorithm which we propose in this paper. 

Random field sparsity 

Let Zj G {0, 1} be a random variable for j G 1 that indicates whether j is selected or not. The most 
general coding scheme is to consider a joint probability distribution of z = [zi, . . . , z p \. The coding 
length for F can be defined as — log 2 p(zi, ...,z p ) with Zj = I(j G F) indicating whether j G F or 
not. 

Such a probability distribution can often be conveniently represented as a binary random field 
on an underlying graph. In order to encourage sparsity, on average, the marginal probability 
p(zj) should take 1 with probability close to 0(l/p), so that the expected number of j's with 
Zj = 1 is O(l). For disconnected graphs (zj are independent), the variables Zj are iid Bernoulli 
random variables with probability 1/p being one. In this case, the coding length of a set F is 
\F\ log 2 (p) — (p — \F\) log 2 (l — 1/p) w \F\ log 2 (p) + 1. This is essentially the probability model for 
the standard sparsity scheme. 

In many cases, it is possible to approximate a general random field coding scheme with block 
coding by using approximation methods in the graphical model literature. However, the details of 
such approximations are beyond the scope of this paper. 

5 Algorithms for Structured Sparsity 

The following algorithm is a natural extension of Lq regularization to structured sparsity problems. 
It penalizes the coding complexity instead of the cardinality (sparsity) of the feature set. 

Konstr = arg min Q((3) subject to c(f3) < s. (1) 
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Alternatively, we may consider the formulation 



'p en = arg mm 



Qifi) + Ac(/3) 



(2) 



The optimization of either ([TJ or ([2| is generally hard. For related problems, there are two 
common approaches to alleviate this difficulty One is convex relaxation {L\ regularization to 
replace Lq regularization for standard sparsity); the other is forward greedy selection (also called 
orthogonal matching pursuit or OMP). We do not know any extensions of L\ regularization like 
convex relaxation that can handle general structured sparsity formulations. However, one can 
extend greedy algorithm by using a block structure. We call the resulting procedure structured 
greedy algorithm or StructOMP, which can approximately solve ([!]). 

We have discussed the relationship of this greedy algorithm and block coding in Section |4j It is 
important to understand that the block structure is only used to limit the search space in the greedy 
algorithm. The actual coding scheme does not have to be the corresponding block coding. However, 
our theoretical analysis assumes that the underlying coding scheme can be approximated with block 
coding using base blocks employed in the greedy algorithm. Although one does not need to know 
the specific approximation in order to use the greedy algorithm, knowing its existence (which can 
be shown for the examples discussed in Section [1]) guarantees the effectiveness of the algorithm. It 
is also useful to understand that our result does not imply that the algorithm won't be effective if 
the actual coding scheme cannot be approximated by block coding. 



Input: (X,y), B C 2 1 , s > 
Output: F( fc ) and (3 {k ^ 
let F(°) = and (3^ = 
for k = 1,2,... 

select G B to maximize progress (*) 

let = U F^-^ 

let = argminggRP Q((3) subject to supp(/3) C F^ 
if (c(/?W) > s) break 
end 



Figure 1: Structured Greedy Algorithm 

In Figure [T] we are given a set of blocks B that contains subsets of 1. Instead of searching all 
subsets F C X up to a certain complexity \F\ + c(F), which is computationally infeasible, we search 
only the blocks restricted to B. It is assumed that searching over B is computationally manageable. 

At each step (*), we try to find a block from B to maximize progress. It is thus necessary to 
define a quantity that measures progress. Our idea is to approximately maximize the gain ratio: 



Q(^-l)) _ 



which measures the reduction of objective function per unit increase of coding complexity. This 
greedy criterion is a natural generalization of the standard greedy algorithm, and essential in our 
analysis. For least squares regression, we can approximate \^ using the following definition 



(B) 



IP, 



B-F( k 



cfBU^- 1 ))-^- 1 )) 



(3) 
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where 

Pp = Xp(X F Xp) 1 x F 

is the projection matrix to the subspaces generated by columns of Xp. We then select B^ so that 

6(B {k) ) > 7 max 6(B), 

where 7 G (0, 1] is a fixed approximation ratio that specifies the quality of approximate optimization. 
Alternatively, we may use a simpler definition 

i fm - W- f .M(^ M) -y)ll 

9{ ' c(BUFN)) - c (F( fc -!)) ' 
which is easier to compute, especially when blocks are overlapping. Since the ratio 

II^B-iKfc-i^li/ll-PB-i^fc-i)!"!!! 



is bounded between p+(B) and p~{B) (these quantities are defined in Definition 6.1), we know that 
maximizing 6(B) would lead to approximate maximization of 6(B) with 7 > p_(B)/p + (B). 

Note that we shall ignore B £ B such that B C F^" 1 ' , and just let the corresponding gain to be 
0. Moreover, if there exists a base block B <£ F {k ~^ but c(B U F^" 1 )) < cfT^ 1 )), we can always 
select B and let = B U F^ 1 ^ (this is because it is always beneficial to add more features 
into F^ without additional coding complexity). We assume this step is always performed if such 
a B G B exists. The non-trivial case is c(B U F^- 1 )) > c(F( fc " 1 )) for all B G B; in this case both 
6(B) and 6(B) are well defined. 



6 Theory of Structured Sparsity 

6.1 Assumptions 

We assume sub- Gaussian noise as follows. 

Assumption 6.1 Assume that {yj}j=i,..., n are independent (but not necessarily identically dis- 
tributed) sub-Gaussians: there exists a constant a > such that Vi and Vi G R, 

Eyi e t(yi-EyO < e - 2 *72. 

Both Gaussian and bounded random variables are sub-Gaussian using the above definition. For 
example, if a random variable £ G [a, b], then E^e 4 ^~ E ^ < e^ fc ~ a ^ * / 8 . If a random variable is 
Gaussian: £ ~ N(0,a 2 ), then E f e*« < e aH2 / 2 . 

The following property of sub-Gaussian noise is important in our analysis. Our simple proof 
yields a sub-optimal choice of the constants. 



Proposition 6.1 Let P G W axn be a projection matrix of rank k, and y satisfies Assumption \6.1 
Then for all rj G (0, 1), with probability larger than 1 — 77; 

||P(y - Ey)||l < a 2 [7Ak + 2.71n(2/r?)]. 
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We also need to generalize sparse eigenvalue condition, used in the modern sparsity analysis. 
It is related to (and weaker than) the RIP (restricted isometry property) assumption p] in the 
compressive sensing literature. This definition takes advantage of coding complexity, and can be 
also considered as (a weaker version of) structured RIP. We introduce a definition. 

Definition 6.1 For all F C {1, . . . ,p}, define 

p-(F) =m\-\\X(3\\ 2 2 /\\P\\ 2 2 : supp(/3) C F 
y n 

p+(F) =sup : sup p(/3) c F 

I n 

Moreover, for all s > 0, define 

p-(s) = mf{p^(F):Fcl,c(F)<s}, 
p+(s) =sup{p+(F) : Fcl,c(F) < s}. 

In the theoretical analysis, we need to assume that p-(s) is not too small for some s that is 
larger than the signal complexity. Since we only consider eigenvalues for submatrices with small 
cost c(/3), the sparse eigenvalue p~(s) can be significantly larger than the corresponding ratio for 
standard sparsity (which will consider all subsets of {1, ... ,p} up to size s). For example, for random 
projections used in compressive sensing applications, the coding length c(supp(/3)) is O(klnp) in 
standard sparsity, but can be as low as c(supp(/3)) = O(k) in structured sparsity (if we can guess 
supp(/3) approximately correctly. Therefore instead of requiring n = O(khip) samples, we requires 
only 0(k + cl(supp(/3))). The difference can be significant when p is large and the coding length 
el(supp(/3)) <C klnp. An example for this is group sparsity, where we have p/ko even sized groups, 
and variables in each group are simultaneously zero or nonzero. The coding length of the groups 
are (k/ko) ln(p/fco), which is significantly smaller than klnp when p is large. 

More precisely, we have the following random projection sample complexity bound for the struc- 
tured sparse eigenvalue condition. The theorem implies that the structured RIP condition is satisfied 
with sample size n = O((k/ko) ln(p/fco)) m group sparsity rather than n = 0(k\n(p)) in standard 
sparsity. Therefore Theorem |6.2| shows that in the compressive sensing applications, it is possible 
to reconstruct signals with fewer number of random projections by using group sparsity (or more 
general structured sparsity). 

Theorem 6.1 (Structured-RIP) Suppose that elements in X are iid standard Gaussian random 
variables N(0, 1). For any t > and 5 G (0, 1), let 

g 

n> -^[ln3 + i + sln(l + 8M)l. 

Then with probability at least l — e~ l , the random matrix X G W nxp satisfies the following structured- 
RIP inequality for all vector (3 € W with coding complexity no more than s: 

(l-5)p| 2 <4=W|| 2 <(l + 5)p|| 2 . (4) 



Although in the theorem, we assume Gaussian random matrix in order to state explicit constants, 
it is clear that similar results hold for other sub- Gaussian random matrices. 
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6.2 Coding complexity regularization 

The following result gives a performance bound for constrained coding complexity regularization. 



Theorem 6.2 Suppose that Assumption 6.1 is valid. Consider any fixed target j3 G W. Then with 



probability exceeding 1 — rj, for all e > and (3 E W such that: Q(j3) < Q((3) + e, we have 
\\X/3 - Ey|| 2 < \\Xp - Ey || 2 + cr^21n(6/??) + 2(7.4a 2 c(/?) + 4.7a 2 1h(6/tj) + e) 1/2 . 
Moreover, if the coding scheme c(-) is sub-additive, then 

np-{<0) + c(/3))||/3 - /9||1 < IQ\\XP - Ey||| + 37a 2 c(/3) + 29a 2 ln(6/r/) + 2.5e. 
This theorem immediately implies the following result for (fTl): V/3 such that c(/3) < s, 

^=\\Xp constr - Ey|| 2 < -^||*/3 - Ey|| 2 + -^^21^6/7?) + ^L(7.4s + 4.71n(6/r ? )) 1 / 2 , 



n yn yn \Jn 

WPconstr ~ < , } ^ [10||X^ - Ey || 2 + 37a 2 s + 29a 2 ln(6/ ?? )] . 
p-(s + c(p))n 

Note that we generally expect p-(s + c({3)) = O(l). The result immediately implies that as sample 
size n — > oo and s/n — > 0, the root mean squared error prediction performance — Ky\\2/\/n 
converges to the optimal prediction performance inf c ^j <s \\Xj3 — Eyl^/y^- This result is agnostic 
in that even if ||X/3 — Ey W2/ \/n is large, the result is still meaningful because it says the performance 
of the estimator (3 is competitive to the best possible estimator in the class c(/3) < s. 

In compressive sensing applications, we take a = 0, and we are interested in recovering /3 
from random projections. For simplicity, we let X/3 = Ey = y, and our result shows that the 
constrained coding complexity penalization method achieves exact reconstruction (3 CO nstr = /? as 



long as p-(2c(/3)) > (by setting s = c{j3)). According to Theorem 6.1 this is possible when 
the number of random projections (sample size) reaches n = 0(c(j3)). This is a generalization of 
corresponding results in compressive sensing [B]. As we have pointed out earlier, this number can 
be significantly smaller than the standard sparsity requirement of n = O(||/3||o hip), if the structure 
imposed in the formulation is meaningful. 



Similar to Theorem 6.2, we can obtain the following result for A related result for standard 



sparsity under Gaussian noise can be found in [S]. 



Theorem 6.3 Suppose that Assumption 6.1 is valid. Consider any fixed target j3 G MP. Then with 



probability exceeding 1 — 77, f or a ^ ^ > 7.4a and a > 7.4a /(A — 7.4a ), we have 

\\Xp pen - Ey\\l < (1 + a) 2 \\Xp - Ey\\ 2 2 + (1 + o)Ac(/3) + a 2 (10 + 5a + 7a- 1 ) ln(6/r/). 

Unlike the result for (l|, the prediction performance \\X(3 pen — Ey|| 2 of the estimator in ^ is 
competitive to (1 + a) \Xj3 — Ey|| 2 , which is a constant factor larger than the optimal prediction 
performance \\X0 — Ey|| 2 . By optimizing A and a, it is possible to obtain a similar result as that of 
Theorem 6.2. However, this requires tuning A, which is not as convenient as tuning s in ([!]). Note 



that both results presented here, and those in are superior to the more traditional least squares 
regression results with A explicitly fixed (for example, theoretical results for AIC). This is because 



one can only obtain the form presented in Theorem 6.2 by tuning A. Such tuning is important in 
real applications. 
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6.3 Structured greedy algorithm 

We shall introduce a definition before stating our main results. 
Definition 6.2 Given B C 2 1 , define 

Pq(B) = max p + (B), cq(B) = maxc(-B) 
BeB BaB 

and 

{b b 
<Bj) : supp(^) C (J Bj (Bj € B) 
3=1 3=1 

The following theorem shows that if c(/3,B) is small, then one can use the structured greedy algo- 
rithm to find a coefficient vector (5^ that is competitive to j3, and the coding complexity c(/?( fc )) is 
not much worse than that of c(j3,B). This implies that if the original coding complexity c(f3) can 
be approximated by block complexity c((3,B), then we can approximately solve (TO). 



Theorem 6.4 Suppose the coding scheme is sub-additive. Consider (5 and e such that 

ee(0,||y||i- W-y|H] 

and 

s> Po (B)c(P,B) ln l|y|||- TO -ylll 
S " 7P-(« + c(/9)) n e 
Then at the stopping time k, we have 

Q(P {k) )<Q(P) + e. 

By Theorem |6.2[ the result in Theorem |6.4| implies that 

\\XpM - Ey|| 2 < \\XP - Ey|| 2 + ay/2]n(6/ri) + 2<r(7.4(s + c (£)) + 4.71n(6/r?) + e/a 2 ) 1/2 , 

- Ml < r~ T^T 7^ [ 10 W " Ey II § + 37a 2 ( S + c (£)) + 29a 2 ln(6/ V ) + 2.5e] . 
p-(s + c {B) + c{(J))n 

The result shows that in order to approximate a signal (3 up to accuracy e, one needs to use 
coding complexity 0(ln(l/e))c((3,B). If B contains small blocks and their sub-blocks with equal 
coding length, and the coding scheme is block coding generated by B, then c(/3,B) = c(j3). In this 
case we need 0(shi(l/e)) to approximate a signal with coding complexity s. 

In order to get rid of the 0(ln(l/e)) factor, backward greedy strategies can be employed, as 
shown in various recent work such as [26]. For simplicity, we will not analyze such strategies in this 
paper. However, in the following, we present an additional convergence result for structured greedy 
algorithm that can be applied to weakly sparse p-compressible signals common in practice. It is 
shown that the ln(l/e) can be removed for such weakly sparse signals. 

Theorem 6.5 Suppose the coding scheme is sub-additive. Given a sequence of targets (3j such that 
Q(Po) < Q(Pi) < ■ ■ and c(Pj,B) < c((3 ,B)/V. If 



7 mm j p-(s + c(pj)) 



3.4 +f 2 -3 ln m±^$m±i 

ft Q(k) - Q(A) + e 
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for some e > 0. Then at the stopping time k, we have 

QW (k) )<Q((3o) + e. 

In the above theorem, we can see that if the signal is only weakly sparse, in that (Q(j3j + i) — 
Q(Po) + £)/(Q((3j) — QiPo) + e) grows sub-exponentially in j, then we can choose s = O(c(/?o, £>)). 
This means that we can find (3^ of complexity s = 0(c(/3q,B)) to approximate a signal (3q. The 
worst case scenario is when Q((3i) ~ Q(0), which reduces to the s = O(c(j3o, B) log(l/e)) complexity 
in Theorem 16.41 

As an application, we introduce the following concept of weakly sparse compressible target 
that generalizes the corresponding concept of compressible signal in standard sparsity from the 
compressive sensing literature |9J. 

Definition 6.3 The target Ey is (a, q) -compressible with respect to block B if there exist constants 
a,q > such that for each s > 0, 3/?(s) such that c(f3(s),B) < s and 

±\\XP( 8 )-Ey\\l<a8-«. 

Corollary 6.1 Suppose that the target is (a, q) -compressible with respect to B. Then with probability 
l — i], at the stopping time k, we have 

Q(P {k) ) < QW)) + trials" 1 + 2a 2 [ln(2/rj) + 1], 

where 

If we assume the underlying coding scheme is block coding generated by B, then min u < s / p_(s + 
c(/3(u))) < p-(s + s'). The corollary shows that we can approximate a compressible signal of 
complexity s' with complexity s = O(qs') using greedy algorithm. This means the greedy algorithm 
obtains optimal rate for weakly-sparse compressible signals. The sample complexity suffers only a 



constant factor 0(q). Combine this result with Theorem 6.2, and take union bound, we have with 
probability 1 — 2r], at stopping time k: 



Ux/fi* - < M + J 21 *! + a J 7A(S +C ° (B))+ 6 - 71n(6/ " ) + 4^, 

n V s'Q V n V n a z s i 

15a 37a 2 (s + c Q {B)) + 34a 2 ln(6/rj) 



\\p (k) - mwi < 1 



p-(s + sf + co(B)) 



s" 1 n 



Given a fixed n, we can obtain a convergence result by choosing s (and thus s') to optimize the right 
hand side. The resulting rate is optimal for the special case of standard sparsity, which implies that 
the bound has the optimal form for structured ^-compressible targets. In particular, in compressive 
sensing applications where a = 0, we obtain when sample size reaches n = 0(qs'), the reconstruction 
performance is 

\\0W-P\\ 2 2 = O(a/sft), 

which matches that of the constrained coding complexity regularization method in Q up to a 
constant 0(q). Since many real data involve weakly sparse signals, our result provides strong 
theoretical justification for the use of OMP in such problems. Our experiments are consistent with 
the theory. 
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7 Experiments 



The purpose of these experiments is to demonstrate the advantage of structured sparsity over 
standard sparsity. We compare the proposed StructOMP to OMP and Lasso, which are standard 
algorithms to achieve sparsity but without considering structure. In our experiments, we use Lasso- 
modified least angle regression (LAS/Lasso) as the solver of Lasso [I]. In order to quantitatively 
compare performance of different algorithms, we use recovery error, defined as the relative difference 
in 2-norm between the estimated sparse coefficient vector (3 est and the ground-truth sparse coefficient 
(3: Wfiest — P\\2/\\P\\2- Our experiments focus on graph sparsity, with several different underlying 
graph structures. Note that graph sparsity is more general than group sparsity; in fact connected 
regions may be regarded as dynamic groups that are not pre-defined. However, for illustration, we 
include a comparison with group Lasso using some ID simulated examples, where the underlying 
structure can be more easily approximated by pre-defined groups. Since additional experiments 
involving more complicated structures are more difficult to approximate by pre-defined groups, we 
exclude group-Lasso in those experiments. 



7.1 Simulated ID Signals with Line-Structured Sparsity 

In the first experiment, we randomly generate a ID structured sparse signal with values ±1, where 
p = 512, k = 64 and g = 4. The support set of these signals is composed of g connected regions. 
Here, each component of the sparse coefficient is connected to two of its adjacent components, 
which forms the underlying graph structure. The graph sparsity concept introduced earlier is 
used to compute the coding length of sparsity patterns in StructOMP. The projection matrix X 
is generated by creating an n x p matrix with i.i.d. draws from a standard Gaussian distribution 
iV(0, 1). For simplicity, the rows of X are normalized to unit magnitude. Zero-mean Gaussian 
noise with standard deviation a = 0.01 is added to the measurements. Our task is to compare the 
recovery performance of StructOMP to those of OMP, Lasso and group Lasso for these structured 
sparsity signals. 

Figure [2] shows one instance of generated signal and the corresponding recovered results by 
different algorithms when n = 160. Since the sample size n is not big enough, OMP and Lasso do 
not achieve good recovery results, whereas the StructOMP algorithm achieves near perfect recovery 
of the original signal. We also include group Lasso in this experiment for illustration. We use 
pre-defined consecutive groups that do not completely overlap with the support of the signal. Since 
we do not know the correct group size, we just try group Lasso with several different group sizes 
(gs=2, 4, 8, 16). Although the results obtained with group Lasso are better than those of OMP 
and Lasso, they are still inferior to the results with StructOMP. As mentioned, this is because the 
pre-defined groups do not completely overlap with the support of the signal, which reduces the 
efficiency. To study how the sample size n affects the recovery performance, we vary the sample 
size and record the recovery results by different algorithms. To reduce the randomness, we perform 



the experiment 100 times for each sample size. Figure 3(a) shows the recovery performance of the 
three algorithms, averaged over 100 random runs for each sample size. As expected, StructOMP is 
better than the group Lasso and far better than the OMP and Lasso. The results show that the 
proposed StructOMP can achieve better recovery performance for structured sparsity signals with 
less samples. 

Note that Lasso performs better than OMP in the first example. This is because the signal 
is strongly sparse (that is, all nonzero coefficients are significantly different from zero). In the 
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Figure 2: Recovery results of ID signal with graph-structured sparsity. (a) original data; (b) 
recovered results with OMP (error is 0.9921); (c) recovered results with Lasso (error is 0.8660);; (d) 
recovered results with Group Lasso (error is 0.4832 with group size gs=2); (e) recovered results with 
Group Lasso (error is 0.4832 with group size gs=4);(f) recovered results with Group Lasso (error 
is 0.2646 with group size gs=8);(g) recovered results with Group Lasso (error is 0.3980 with group 
size gs=16); (h) recovered results with StructOMP (error is 0.0246). 
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second experiment, we randomly generate a ID structured sparse signal with weak spar si ty, where 
the nonzero coefficients decay gradually to zero, but there is no clear cutoff. As expected in our 



theory (see Theorem 6.5 and discussions thereafter), OMP becomes much more competitive relative 
to Lasso. In fact OMP has better performance than Lasso in this more realistic situation. One 
instance of generated signal is shown in Figure [I] (a). Here, p = 512 and all coefficient of the 
signal are not zeros. We define the sparsity k as the number of coefficients that contain 95% of 
the image energy. The support set of these signals is composed of g = 2 connected regions. Again, 
each element of the sparse coefficient vector is connected to two of its adjacent elements, which 
forms the underlying ID line graph structure. The graph sparsity concept introduced earlier is 
used to compute the coding length of sparsity patterns in StructOMP. The projection matrix X 
is generated by creating an n x p matrix with i.i.d. draws from a standard Gaussian distribution 
N(0, 1). For simplicity, the rows of X are normalized to unit magnitude. Zero-mean Gaussian noise 
with standard deviation a = 0.01 is added to the measurements. 

Figure [4] shows one generated signal and its recovered results by different algorithms when k = 32 
and n = 48. Again, we observe that OMP and Lasso do not achieve good recovery results, whereas 
the StructOMP algorithm achieves near perfect recovery of the original signal. We try group Lasso 
with several different group sizes (gs=2, 4, 8, 16). Although the results obtained with group Lasso 
are better than those of OMP and Lasso, they are still inferior to the results with StructOMP. In 
order to study how the sample size n effects the recovery performance, we vary the sample size 
and record the recovery results by different algorithms. To reduce the randomness, we perform the 



experiment 100 times for each of the sample sizes. Figure 3(b) shows the recovery performance of 
different algorithms, averaged over 100 random runs for each sample size. As expected, StructOMP 
algorithm is superior in all cases. What's different from the first experiment is that the recovery 
error of OMP becomes smaller than that of Lasso. This result is consistent with our theory, which 
predicts that if the underlying signal is weakly sparse, then the relatively performance of OMP 
becomes comparable to Lasso. 

7.2 2D Image Compressive Sensing with Tree-structured Sparsity 

It is well known that 2D natural images are sparse in a wavelet basis. Their wavelet coefficients have 
a hierarchical tree structure, which is widely used for wavelet -based compression algorithms 



Figure 5(a) shows a widely used example image with size 64 x 64: cameraman. Each 2D wavelet 
coefficient of this image is connected to its parent coefficient and child coefficients, which forms the 
underlying hierarchical tree structure (which is a special case of graph sparsity). In our experiment, 
we choose Haar-wavelet to obtain its tree-structured sparsity wavelet coefficients. The projection 
matrix X and noises are generated with the same method as that for ID structured sparsity signals. 
OMP, Lasso and StructOMP are used to recover the wavelet coefficients from the random projec- 
tion samples respectively. Then, the inverse wavelet transform is used to reconstruct the images 
with these recovered wavelet coefficients. Our task is to compare the recovery performance of the 
StructOMP to those of OMP and Lasso. 

Figure [5] shows one example of the recovered results by different algorithms. It shows that 



StructOMP obtains the best recovered result. Figure 6(a) shows the recovery performance of the 
three algorithms, averaged over 100 random runs for each sample size. The StructOMP algorithm 
is better than both Lasso and OMP in this case. Since real image data are weakly sparse, the 
performance of standard OMP (without structured sparsity) is similar to that of Lasso. 
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Figure 4: Recovery results of ID weakly sparse signal with line-structured sparsity. (a) original 
data; (b) recovered results with OMP (error is 0.5599); (c) recovered results with Lasso (error is 
0.6686);; (d) recovered results with Group Lasso (error is 0.4732 with group size gs=2); (e) recovered 
results with Group Lasso (error is 0.2893 with group size gs=4);(f) recovered results with Group 
Lasso (error is 0.2646 with group size gs=8);(g) recovered results with Group Lasso (error is 0.5459 
with group size gs=16); (h) recovered results with StructOMP (error is 0.0846). 
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Figure 5: Recovery results with sample size n = 2048: (a) the background subtracted image, (b) 
recovered image with OMP (error is 0.21986), (c) recovered image with Lasso (error is 0.1670) and 
(d) recovered image with StructOMP (error is 0.0375) 
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Figure 6: Recovery error vs. Sample size: a) 2D image with tree-structured sparsity in wavelet 
basis; (b) background subtracted images with structured sparsity 



7.3 Background Subtracted Images for Robust Surveillance 

Background subtracted images are typical structure sparsity data in static video surveillance ap- 
plications. They generally correspond to the foreground objects of interest. Unlike the whole 
scene, these images are not only spatially sparse but also inclined to cluster into groups, which 
correspond to different foreground objects. Thus, the StructOMP algorithm can obtain superior 
recovery from compressive sensing measurements that are received by a centralized server from 
multiple and randomly placed optical sensors. In this experiment, the testing video is downloaded 
from htt p://homepages.inf.ed.ac.uk/rbf/CAVIARD ATA1/I The background subtracted images are 
obtained with the software [2E|. One sample image frame is shown in Figure 7(a) The support set 



of 2D images is thus composed of several connected regions. Here, each pixel of the 2D background 
subtracted image is connected to four of its adjacent pixels, forming the underlying graph structure 
in graph sparsity. The results shown in Figure [7] demonstrate that the StructOMP outperforms 
both OMP and Lasso in recovery. We randomly choose 100 background subtracted images as test 
images. Figure 6(b) shows the recovery performance as a function of increasing sample sizes. It 



demonstrates again that StructOMP significantly outperforms OMP and Lasso in recovery perfor- 
mance on video data. Comparing to the image compression example in the previous section, the 
background subtracted images have a more clearly defined sparsity pattern where nonzero coeffi- 
cients are generally distinct from zero (that is, stronger sparsity); this explains why Lasso performs 
better than the standard (unstructured) OMP on this particular data. The result is again consistent 
with our theory. 



8 Discussion 

This paper develops a theory for structured sparsity where prior knowledge allows us to prefer 
certain sparsity patterns to others. Some examples are presented to illustrate the concept. The 
general framework established in this paper includes the recently popularized group sparsity idea 
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Figure 7: Recovery results with sample size n = 900: (a) the background subtracted image, (b) 
recovered image with OMP (error is 1.1833), (c) recovered image with Lasso (error is 0.7075) and 
(d) recovered image with StructOMP (error is 0.1203) 

has a special case. 

In structured sparsity, the complexity of learning is measured by the coding complexity c{(5) < 
||/3||o + cl(supp(/3)) instead of ||/?||olnp which determines the complexity in standard sparsity. Using 
this notation, a theory parallel to that of the standard sparsity is developed. The theory shows that if 
the coding length cl(supp(/3)) is small for a target coefficient vector (3, then the complexity of learning 
(3 can be significantly smaller than the corresponding complexity in standard sparsity. Experimental 
results demonstrate that significant improvements can be obtained on some real problems that have 
natural structures. 

The structured greedy algorithm presented in this paper is the first efficient algorithm proposed 
to handle the general structured sparsity learning. It is shown that the algorithm is effective under 
appropriate conditions. Future work include additional computationally efficient methods such as 
convex relaxation methods (e.g. L\ regularization for standard sparsity, and group Lasso for strong 
group sparsity) and backward greedy strategies to improve the forward greedy method considered 
in this paper. 
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A Proof of Proposition 6.1 



Lemma A.l Consider a fixed vector x £ W 1 , and a random vector y £ W 1 with independent 
sub-Gaussian components: Ee^ yi_Ey ^ < e CT * / 2 for all t and i, then Ve > 0: 



Pr 



x T y - Ex y 



> e < 2e 



-e 2 /(2<x 2 ||x||2) 



Proof Let s n = ^ILi^y* ~~ Ex iy«); tnen by assumption, E(e tSn + e tSn ) < 2e^ x ? ct2 * 2 /2 ; w hi c h 
implies that Pr(|s n | > e)e te < 2e^ x i CT * I 2 . Now let t = e/Q^ x 2 <r 2 ), we obtain the desired bound. 



The following lemma is taken from [2U]. Since the proof is simple, it is included for completeness. 



Lemma A. 2 Consider the unit sphere S k 1 = {x : ||x||2 = 1} in M fc (k > 1). Given any e > 0, 
there exists an e-cover Q C S k ~ 1 such that mm qe Q \\x — q\\2 < e for all \\x\\2 = 1, with \Q\ < 
(l + 2/e) k . 

Proof Let B k = {x : ||x||2 < 1} be the unit ball in R k . Let Q = {<Zi}i=i,...,|Q| C S k ~ 1 be a maximal 
subset such that \\qi — q^- 1 1 2 > e for all i / j. By maximality, Q is an e-cover of S^ 1 . Since the 
balls qi + (e/2)B k are disjoint and belong to (1 + e/2)B k , we have 

vol( gi + (e/2)B k ) < vol{(l + e/2)B k ). 

i<\Q\ 

Therefore, 

\Q\(e/2) k vol(B k ) < (l + e/2) k vol(B k ), 
which implies that \Q\ < (1 + 2/e) k . U 
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Proof of Proposition |6.1 



According to Lemma A. 2 given e± > 0, there exists a finite set Q = {qi} with \Q\ < (1 + 2/ei) 



such that || -Ptfi || 2 = 1 for all i, and min^ \\Pf3 — Pqi\\2 < ei for all ||P/3|| 2 = 1- 
For each i, Lemma [ATT] implies that Ve2 > 0: 



Pr (\qJP(y-Ey] 



> e 2 < 2e" e 2 



£ |/(2<x 2 ) 



Taking union bound for all qi G Q, we obtain with probability exceeding 1 — 2(1 + 2/ei) fc e e 2/ 2o " 2 : 



gj T P(y - Ey) 



< e 2 



for all i. 

Let /? = P(y - Ey)/||P(y - Ey)|| 2 , then there exists i such that ||P/3 - Pqi\\ 2 < e x . We have 

||P(y-Ey)||2=/3 T (y-Ey) 

<||P/3 - P%|| 2 ||P(y - Ey)|| 2 + |^P(y - Ey)| 
<ei||P(y-Ey)|| 2 + e 2 . 

Therefore 

||P(y-Ey)|| 2 <e 2 /(l-e 1 ). 
Let E X = 2/15, and 77 = 2(1 + 2/ei) fc e~ £ 2/ 2<T2 , we have 

e 2 2 = 2a 2 [(4/c + l)ln2-lnr/], 

and thus 

15 



|P(y - Ey)|| 2 < — (7^2(4/; + l)ln2-21nr?. 



This simplifies to the desired bound. 



B Proof of Theorem 16.1 



We use the following lemma from [TB"] . 

Lemma B.l Suppose X is generated according to Theorem \6.1\ For any fixed set F C T with 
\F\ = k and < S < 1, we have with probability exceeding 1 — 3(1 + 8/5) k e~ nS / 8 : 



(1 



2 < -L||*F0||2 < (1 + 

'n 



(5) 



for all f3 S 
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Proof of Theorem I6TH 

Since cl(P) is a coding length, we have 

£ (1 + 8/0'" < E + 

F:|F|+cl(F)<s F:|F|+7cl(F)<s 

< + 8/5) s ^ cl{F) = (1 + 8/<5) s E 2_d(F) ^ C 1 + 8 A^ 

F F 

where we let 7 = l/log 2 (l + 8/5) in the above derivation. 

For each F, we know from Lemma B.l that for all f3 such that supp(/3) C F: 

(1-Wl| 2 <-4WI| 2 <(1 + Wl|a 
yn 

with probability exceeding 1 - 3(1 + 8/<5)l F le- n ' 52 / 8 . 

We can thus take the union bound over F : \F\ + cl(F) < s, which shows that with probability 
exceeding 

1- J2 3(l + 8/5)l F le-^ 2 / 8 , 

F:\F\+cl(F)<s 

the structured RIP in Equation Q holds. Since 

E 3(1 + 8/<5) |F| e~ n,52/8 < 3(1 + 8/5) s e- nS2/8 < e~\ 

F:\F\+c\(F)<s 

we obtain the desired bound. 



C Proof of Theorem 6.2 and Theorem 6.3 



Lemma C.l Suppose that Assumption \ 6.1\ is valid. For any fixed subset F C T, we have with 
probability 1 — rj, V/3 such that supp(/3) C F , and a > 0, we have 

\\XP - Ey||| < (1 + a)[\\X(3 - yf 2 - ||y - Ey|||] + (2 + a + aT 1 )^^ A\F\ + 4.7^(4/7?)]. 



Proof Let 

Pp = Xp(XpXp) 1 Xp 

be projection matrix to the subspaces generated by columns of Xp. 

Let a = (I- P F )Ey/||(/ - P F )Ey|| 2) <5i = ||P F (y - Ey)|| 2 and 5 2 = |a T (y - Ey)|, we have 



\\xp 


-Eylll 










=\\xp 


-ylll- 


l|y 


-Ey 


|| + 2(y-Ey) T (X/3-Ey) 




=\\xp 


-ylll- 


lly 


-Ey 


|2 + 2(y-Ey) T (X/3-P F Ey)- 


25 T (y-Ey)||(/-P F )Ey|| 2 


=\\xp 


-ylll- 


lly 


-Ey 


|^ + 2(y-Ey) T P F (X/3-P F Ey) 


-2a T (y-Ey)||(/-P F )Ey|| 2 


<\\ X P 


-ylll- 


lly 


-Ey 


|| + 2 < J 1 ||X / a-P F Ey|| 2 + 25 2 ||(I 


-P F )Ey|| 2 


<\\ X P 


-ylll- 


lly 


-Ey 


\l + 2y/6f + 5^\\Xp-P F Eyf 2 


+ ||(I-P F )Ey||2 


=\\xp 


-ylll- 


lly 


-Ey 


|2 + 2 A /5 1 2 + J 2 2 ||X/5-Ey|| 2 . 
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Note that in the above derivation, we have used the fact that P F Xf3 = X(3, and \\Xf3 - PjrEy||| + 
||(/ -P F )Ey HI = \\X/3-Ey\\l 

Now, by solving the above inequality, we obtain 



\\X(3-Ey\\ 2 2 < 



\\X0 - y\\ 2 2 - ||y - Ey||l + S 2 + 5 2 + JS 2 + 8 2 



<(1 + a)[\\Xp - y\\ 2 2 - ||y - Ey||l] + (2 + a + l/a)(<5 2 + 5 2 2 ). 



The desired bound now follows easily from Proposition 6J_ and Lemma |A.1| where we know that 
with probability 1 — rj/2, 

Sf = (y - Ey) T P F (y - Ey) < a 2 (7A\F\ + 2.71n(4/»j)), 

and with probability 1 — rj/2, 

^ = |5 T (y-Ey)| 2 <2a 2 ln(4/7?). 
We obtain the desired result by substituting the above two estimates and simplify. ■ 



Lemma C.2 Suppose that Assumption 6A_ is valid. Then we have with probability 1 — rj, V/3 G MP 
and a > 0: 

\\XP - Ey||! < (1 + a) [\\X0 - y||| - ||y - Ey|| 2 ] + (2 + a + 1/ a)a 2 [7 Ac(f3) + 4.71n(4/r?)]. 



Proof Note that for each F, with probability 2 cl( - F ^rj, we obtain from Lemma C.l that Vsupp(/3) G 
F, 

\\X(3 - Ey\\ 2 2 < (1 + a) [\\X(3 - y\\ 2 - ||y - Ey \\ 2 ] + (2 + a + l/a)<r 2 [7.4(|F| + cl(F)) + 4.71n(4/r/)]. 
Since YIfci 2 _cl< - F ' ) ry < 77, the result follows from the union bound. ■ 

Lemma C.3 Consider a fixed subset F C I. Given any rj £ (0, 1), we have with probability 1 — r\: 
IPj9-y||!-||y-Ey||H < W ~ Ey||l + 2a v /21n(2/r ? )||X^ - Ey|| 2 . 



Proof Let a = (Xf3 - Ey)/||X/3 - Ey|| 2 , we have 

IW-ylH-lly-Eyllll 
= | - 2(XP - Ey) T (y - Ey) + \\XP - Ey\\ 2 \ 
<2\\XP - Ey|| 2 |a T (y - Ey)| + ||Ey - Xfig 



The desired result now follows from Lemma IA.1I 
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Lemma C.4 Suppose that Assumption 6A_ is valid. Consider any fixed target (3 S W. Then with 
probability exceeding 1-rj, for all A > 0, e > 0, (3 £ W such that: Q0) + Ac(/3) < Q{(3) + \c{(3) + e, 
and for all a > 0, we have 

\\X$ - Ey|| 2 <(1 + a)[\\XP - Ey\\ 2 2 + 2a v / 21n(6/r/)||X^ - Ey|| 2 ] 
+ (1 + a)Xc(p) + a'c0) + b' ln(6/r?) + (1 + a)e, 

where a' = 7.4(2 + a + a _1 )a 2 — (1 + a)A and b' = 4.7a 2 (2 + a + a _1 ). Moreover, if the coding scheme 
c(-) is sub-additive, then 

np_(c0) + c(/?))||/3 - < 10\\Xj3 - Eyg + 2.5Ac(/5) + (37a 2 - 2.5A)c(/3) + 29a 2 111(6/7?) + 2.5e. 



Proof We obtain from the union bound of Lemma C.2 (with probability 1 — rj/3) and Lemma C.3 
(with probability 1 — 277/3) that with probability 1 — rj: 



\\XP-Ey" 2 



2 



<(1 + a) \\\XP - y\\i - \\y - Ey^j + (2 + a + a~ l )[7 Aa z c(f3) + 4.7a 2 ln(6/ ?? )] 

<(l + a) [||X^-y|| 2 - ||y-Ey|| 2 + Ac(^) + e] + a'c0) + b' ln(6/ V ) 
<(1 + a)[||X/3 - Ey||| + 2a A /21n(6/7 ? )||X^ - Ey|| 2 ] + (1 + a)\c0) + a'c(/3) 
+ 6'ln(6/r7) + (l + a)e. 

This proves the first claim of the theorem. 
The first claim with a = 1 implies that 

\\XP - X~P\\l < [\\X$ - Ey|| 2 + ||X/3 - Ey || 2 ] 2 
<1.25||X/3 - Ey || 2 + 5\\XP - Eyg 

<7.5\\XP - Eyg + 5(7^2 ln(Q/rj)\\Xp - Ey|| 2 + 2.5Ac(/3) + 1.25(29.6a 2 - 2A)c(/3) 

+ 1.25 x 18.8a 2 ln(6/ry) + 2.5e 
<10||X/3 - Ey Hi + 2.5Ac(/3) + (37a 2 - 2.5A)c(/3) + 29a 2 ln(6/r?) + 2.5e. 

Since c0 - 0) < c(/3) + c(/3), we have pC/3- > np-(c(P) + c(/3))||/3 - ^|||. This implies the 
second claim. ■ 



Proof of Theorem I6T2I 

We take A = in Lemma [Q4| and obtain: 

\\X$ - Eyg <(1 + a)[||X£ - Ey|| 2 + 2a^/2 ln(6/r/) - Ey|| 2 ] 

+ 7.4(2 + a + a" 1 )^ 2 ^) + 4.7a 2 (2 + a + a" 1 ) ln(6/r/) + (1 + a)e 
=(||X/3 - Ey || 2 + aV21n(6/r/)) 2 + 14.8a 2 c(/3) + 7.4a 2 ln(6/r/) + e 

+ a[(\\Xp - Ey || 2 + aV21n(6/r/)) 2 + 7.4a 2 c(/5) + 2.7a 2 ln(6/r?) + e] 
+ a" 1 [7.4a 2 c(/3) + 4.7a 2 ln(6/»/)]. 
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Now let z = \\XP - Ey || 2 + oyjl 111(6/77), and we choose a to minimize the right hand side as: 

\\X$ - Ey||| <z 2 + U.8a 2 c0) + 7.4cr 2 01(6/77) + e 

+ 2[z 2 + 7.4ct 2 c(/3) + 2.7<r 2 ln(6/r/) + e} l ' 2 [7 Aa 2 c0) + 4.7a 2 111(6/77)] ^ 
<[{z 2 + 7.4a 2 c0) + 2.7a 2 01(6/77) + e) 1/2 + {7.4a 2 c0) + 4.7a 2 ln^/r/)) 1 / 2 ] 2 
<[z + 2{7 Aa 2 c0) + 4.7a 2 01(6/77) + e) 1/2 ] 2 . 



This proves the first inequality. The second inequality follows directly from Lemma C.4 with A = 0. 



Proof of Theorem [631 

The desired bound is a direct consequence of Lemma C.4| by noticing that 



2a^2ln(6/ V )\\Xp -Ey|| 2 < a\\Xp - Ey \\ 2 2 + a~ l 2a 2 01(6/77), 

a' < 0, and 

6' + a^cr 2 < (10 + 5a + 7a" 1 )a 2 . 



D Proof of Theorem 16.41 and Theorem 16.5 



The following lemma is an adaptation of a similar result in |26J on greedy algorithms for standard 
spar si ty. 

Lemma D.l Suppose the coding scheme is sub-additive. Consider any (3, and a cover of (3 by B: 

supp(^) C F = U b j=l Bj (Bj e B). 
Let c0,B) = Y?j=i c (Bj). Let po = max,,- p + (Bj). Then for all F such that c(Bj U F) > c(F), 
ft = arg min \\Xf3' — y[|| subject to supp(/3') C F, 

and \\X/3 - y|| 2 > \\Xj3 — y\\l, we have 

max MB,) > P ~ {F ,^h w - ylll - W ~ y|||], 
where as in pi), we define 

m = c(B U F) - c(F) ■ 

Proof For all £ G F, \\X(3 + aXe^ — y||| achieves the minimum at a = (where is the vector 
of zeros except for the £-th component, which is one). This implies that 

x £ T (X/3-y) = 
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for all 1 € F. Therefore we have 

(X/3-y) T ^ fa-Pi)** 

=(Xp-y) T (Pe-Mxi = (XP-y) T (Xp-X(3) 



CeFUF 



\\xtf-Ml + «\\xp-y\\l-«\\x0-y\\l 



Now, let Bj C Bj — F be disjoint sets such that UjB'j = F — F. The above inequality leads to the 
following derivation V77 > 0: 

-^^BjMBjUF)-^)) 



< 



E 



a ^2 



\\X0-y 



<V 2 E Wi-Pi) 2 Pon + 2i 1 (XP-y) T £ (# - t )m 

i&F-F EeF-F 

<rf E Cfa ~ fa? Pon - v\W - ml + ri\\XP-y\\l-i 1 \\X [3 -y\\i 



eeF-F 



Note that we have used the fact that \\P B -f(XP - y)||| > ||X/3-y||| - \\X0-y + XAp\\% for all 
A/3 such that supp(A/3) C B — F. By optimizing over 77, we obtain 

max 0(5,) J] c (^i) ^ E ^WiciBj U F) — c(F)) 



> 



> 



> 



[||X(^-/3)||| + ||X/3-y|||-||X^-y|||] 2 



4E 



(Pi - Pe) 2 pon 



e&F-F 

4||X(^-/3)|||[||X/3_-y|||-||X^-y||i] 
4 12teF-F(Pe ~ Pe) 2 Pon 

P -^^[\\Xp-yf 2 -\\XP-yf 2 ). 
Po 



This leads to the desired bound. 



Proof of Theorem [6Til 

Let 



7j o_(5 + c(F)) 
po(B)c(p,B) ■ 



By Lemma D.l we have at any step k > 0: 



\\xp( k -v 



ylll 



\\xpw 



y\\2 



> i'[\\xp( k -V 



ylll 



ll^-yUKc^W)-^- 1 )), 
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which implies that 

max[0, \\XfF> - y||| - \\X0 - y|||] < max[0, l^*" 1 ) -yg- \\Xfi - yg^'^-^^. 
Therefore at stopping, we have 

\\XP^-y\\ 2 2 -\\XP-yf 2 
<[\\y\\l-\\Xi3-y\\l]e-^ ( ^ 
<[||y||I- \\Xp-yg]e-^<e. 

This proves the theorem. 
Proof of Theorem IQ1 

For simplicity, let fj = Q(j3j). For each k, let jk be the largest j such that 

W k) ) >fj + fj -fo + e. 

Let i = (7mm j p_( S + C ( / 3 i )))/(po(S) C (^o,S)). 

We prove by contradiction. Suppose that the theorem does not hold, then for all k before 
stopping, we have jk > 0. 

For each k > before stopping, if jk = jk-\ = j, then we have from Lemma D.l (with = (3j) 

\wa(k-l) -.112 r 

c(/5 (fc) ) < c{(3 (k ~ l) ) + 7 ^ 1 2- J 'ln 
Therefore for each j > 0, we have: 



y\\l-h 



HP {k) ) - c(/3 (fc-1) )] < y-^-^' In 

k-jk=jk-i=j 



\x^ k - 1 ) 

2(/,+i -fo + e) 



fj -fo + e 



Moreover, for each j > 0, Lemma |D.1| (with (3 = j3j) implies that 

k:jk=j,jk-i>j 



Therefore we have 



k:jk=j 



1.7 + In 



fj+i -fo + e 
fj- fo + e 



Now by summing over j > 0, we have 

c(/?( fc )) < 3.47'- 1 + V" 1 T ^ In ~/° + " < s - 

^ fj-fo + e ~ 



j=0 



This is a contradiction because we know at stopping, we should have c(P^) > s. 
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E Proof of Corollary |6.1 



Given s', we consider fj = min^>j Q(f3(s' /2 e )). We may assume that /o is achieved with Iq = 0. 
Note that by Lemma C.3 we have with probability 1 — 2~- 7 ~ 1 r/: 

\QW /&)) - ||y - Ey[|!| <2\\Xptf/2i) - Ey||| + 2a 2 [j + 1 + ln(2/ V )] 

<2an2 qj /s' q + 2a 2 [j + l + ln(2/r/)]. 

This means the above inequa lity holds for all j with probability 1 — r\. Now, by taking e = 2an/s' g + 
2cr 2 [ln(2/ry) + 1] in Theorem 6.5 we obtain 



OO n n OO 

£ 2 ~ J ln / -"A/ ^ E + (/i+i - fo)M 

„• n J 3 JO ' £ • « 

j=o J j=e 

OO 

< E 2_jln (2 + 2(i + 2^' +1 ))) 

OO 

<E 2_j ( ln2 + 1 +i + 9(j + l)ln2) < 2 + 4(1 + gln2), 

where we have used the simple inequality In (a + f$) < a + ln(/3) when a, /3 > 1. Therefore, 

Po(B)s' 



s > 



> 



7min u < s / p_(s + c((3(u))) 

Po(B)s' 
'7min u < s / + c(j3(u))) 



(10 + 3g) 



3.4 + £ 2 ~ j ln 

i=o 



fj+i ~ fo + e 
/j - /o + e 



This means that Theorem 6.5 can be applied to obtain the desired bound. 
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